UC Davis DataLab Install Guide
Overview
This toolkit will walk you through installing some of the most common data science tools used at the UC Davis DataLab and beyond. Use the sidebar on the left to navigate to the specific program you would like to install. If you are taking a class or workshop using these tools, please install them prior to attending.
Anaconda (Python)
Introduction
Anaconda on Windows
Anaconda on Mac
Verifying your install
Installation troubleshooting
If you are not able to successfully install Anaconda on your own, please attend DataLab’s Virtual Office Hours. Click here for more information and to receive a Zoom link.
DBeaver (SQL Database)
Introduction
DBeaver on Windows
DBeaver on Mac
Verifying your install
Installation troubleshooting
If you are not able to successfully install DBeaver on your own, please attend DataLab’s Virtual Office Hours. Click here for more information and to receive a Zoom link.
Git
Introduction
Git is a ubiquitous software for version control. Version control describes a process of storing and organizing multiple versions (or copies) of documents that you create. Approaches to version control range from simple to complex and can involve the use of various human workflows and/or software applications to accomplish the overall goal of storing and managing multiple versions of the same document(s).
Most people have a folder/directory somewhere on their computer that looks something like this:
Or perhaps, this:
This is a rudimentary form of version control that relies completely on the human workflow of saving multiple versions of a file. This system works minimally well, in that it does provide you with a history of file versions theoretically organized by their time sequence. But this filesystem method provides no information about how the file has changed from version to version, why you might have saved a particular version, or specifically how the various versions are related. This human-managed filesystem approach is more subject to error than software-assisted version control systems. It is not uncommon for users to make mistakes when naming file versions, or to go back and edit files out of sequence. Software-assisted version control systems such as Git were designed to solve this problem.
If you would like more information on how Git works, read more in the Git Book for free.
Git on Windows
Follow these step-by-step instructions if you’re installing Git on a Windows machine:
First, launch a web browser, the image below shows the Microsoft Edge browser:
Next, navigate to the following Git download URL in your browser https://git-scm/com/downloads:
Select “Windows” from the Downloads portion of the Git webpage. Git will display the following page and automatically being downloading the correct version of the Git software. If the download doesn’t start automatically, click on the “Click here to download manually link”:
When the download is complete, open/Run the downloaded file (will look different in different browsers, but everyone should know how to do this):
A screen will appear asking for permissions for the Git application to make changes to your device. Click on the Yes button:
Click Next to accept the user license:
Leave the default “Destination Location” unchanged (usually C:\Program Files\Git) and hit Next
You will see a screen like the one below asking you to “Select Components”:
Leave all of the default components selected and also check the boxes next to “Additional Icons” and it’s sub-item, “On the Desktop.” Your completed configurations window should have the following components selected:
Additional Icons
-> On the Desktop
Windows Explorer integration
-> Git Bash Here
-> Git GUI Here
Git LFS (Large File Support)
Associate .git* configuration files with default text editor
Associate .sh files to be run with Bash
And should look like this:
After verifying that you have the necessary components selected as per above, hit Next.
The next screen will ask you to “Select a Start Menu Folder.” Keep the default value of Git and hit Next:
Leave the default “Use Vim (the ubiquitous text editor) as Git’s default editor” selected on the “Choosing the default editor used by Git” screen and hit Next:
On the next screen, leave the default “let Git decide” option selected and hit Next:
Leave the default “Git from the command line and also from 3rd-party software” selected and hit Next:
On the next “Choosing HTTPS transport backend” page leave the default “Use the OpenSSL library” option selected and hit Next:
Leave the default “Checkout Windows-style, commit Unix-style line endings” selected on the next page and hit Next:
Keep the default “Use MinTTY (the default terminal of MSYS2)” selected on the “Configuring the terminal emulator to use with Git Bash” window and hit Next:
Keep the default value of “Default (fast-forward or merge)” on the “Choose the default behavior of ‘git pull’” page and hit Next:
Keep the default value of “Git Credential Manager Core” on the “Choose a credential helper” page and hit Next:
Keep the default values on the “Configuration extra options” page by keeping “Enable file system caching” checked and “Enable symbolic links” unchecked and then hit Next:
Make sure that no options are checked in the “Configuring experimental options” page and hit Install:
After you hit this Install button as per above, you will see an install progress screen like the one below:
When the install is complete, a new, “Completing the Git Setup Wizard” window like the one below will appear:
Make sure that all of the options on this window are unchecked as in the image below and then hit the Finish button:
This will complete your installation process.
Windows users should verify that when downloading Git for Windows they have also installed Git Bash, which is necessary for working with Git in command line.
Git on Mac
If you are installing Git on a Mac, there is no extra configuration. Simply go the git download page at https://git-scm.com/downloads and choose the latest version for mac, and run the installer package when it is finished downloading. If you get an “unknown developer” warning during the install process, follow the instructions at the beginning of the video at https://www.youtube.com/watch?v=__kr-Ew5kbE to help you work through this problem.
Verifying your install
Whether you’re installing on Windows or Mac, note that unlike most applications that you’ve installed before, you will not find a “Git” application in your programs or applications directory once the installation is complete. As long as you don’t get an explicit error message during the installation process, you can assume that the software was successfully installed. Git is a command-line application with which you interact using the command-line, which we’ll cover during the interactive session. If you’re already familiar with using command line, you can verify your install by opening the terminal (for Windows that will be Git Bash in your programs menu) and type git –version. You should then see a response of your installed version (e.g., git version 2.12.2.windows.2, or git version 2.12.2.mac.2), and not the error “command not found.”
Installation troubleshooting
If you are not able to successfully install Git on your own, please attend DataLab’s Virtual Office Hours. Click here for more information and to receive a Zoom link.
Jupyter Notebooks
Introduction
Jupyter on Windows
Jupyter on Mac
Verifying your install
Installation troubleshooting
If you are not able to successfully install Jupyter Notebooks on your own, please attend DataLab’s Virtual Office Hours. Click here for more information and to receive a Zoom link.
Linux Subsystem for Windows (LSW; a.k.a. Command Line)
Introduction
LSW on Windows
Verifying your install
Installation troubleshooting
If you are not able to successfully install the Linux Subsystem for Windows on your own, please attend DataLab’s Virtual Office Hours. Click here for more information and to receive a Zoom link.
OpenRefine
Introduction
OpenRefine is an open source tool used to clean and pre-process messy data. While most people are familiar with data cleaning in their coding tool of choice (R, Python, Julia, etc.), OpenRefine is designed to provide powerful cleaning capabilities with minimal overhead. One of the most helpful capabilities of OpenRefine is the ability to check for possible duplicates and misspellings of text data using it’s text facet tools.
OpenRefine on Windows
Open your web browser of choice and navigate to the OpenRefine homepage at https://openrefine.org/. Click on the download button in the left sidebar.
On the download page, scroll to the latest version of OpenRefine and select the Windows kit. If you are unsure if you have Java installed on your system, choose the Windows kit with embedded Java instead.
Once the download has completed, open the zip and move the contents to a convenient location on your computer.
Open the resulting directory, and double click on the openrefine.exe executable.
The OpenRefine executable will start a terminal window, and shortly after launch a tab in your default web browser with the OpenRefine interface.
OpenRefine on Mac
If you are installing OpenRefine on a Mac, there is no extra configuration. Simply go the download page for OpenRefine and choose the latest version for mac. Run the installer package when it is finished downloading. If you receive an error regarding the app being from an unidentified developer, please follow the instructions here.
Verifying your install
To verify everything is working, first start Openrefine. It will open a page in your browser of choice that resembles the following.
Click the Choose Files button, and enter this dataset (you can just put in the URL). Click Next.
Openrefine will load in the data and present you with a preview. The defaults should be fine. Click Create Project in the upper right hand corner.
You will then be presented with the Openrefine working area. Click the arrow next to the What sector ... column and select Facet -> Text facet.
In the left hand menu, click the Cluster button.
In the following menu, for method select nearest neighbor. Openrefine will look through that column for any strings that are similar, and show you. This can be helpful for finding typos. Here, we see there are two misspellings of “Academia.” Click the Check-box in the Merge? column, then select Merge Selected & Close. Openrefine will change all strings in the Values in Cluster column to match the New Cell Value. If that all worked, Openrefine is working!
Installation troubleshooting
If you are not able to successfully install OpenRefine on your own, please attend DataLab’s Virtual Office Hours. Click here for more information and to receive a Zoom link.
Python Modules
Introduction
Python modules are extensions to the basic capabilities of Python. You can install modules from the terminal where you call Python.
Installing Modules using Conda
If you are using Anaconda, it is recommended you install modules from conda sources to ensure compatibility. First, start a terminal using the Python environment you want to install the module in.
(Optional) If you are using a graphical interface, use the Anaconda Navigator to launch a terminal of your chosen environment.
Once you are in the terminal of your chosen environment, you can install any module you know the name of using conda install <module name>.
conda will then search its repositories for a module matching the name you provided. Note that if you are looking for a module python said was missing in an error code, it may not be the same name you need to use when installing! You will need to search online if the install fails because it says there is no module of that name.
Installing Modules using Pip
If you are not using conda, or conda could not find a module, you will need to use pip to install modules. First, start a terminal using the Python environment you want to install the module in.
Once you are in the terminal of your chosen environment, you can install any module you know the name of using pip install <module name>.
pip will then search its repositories for a module matching the name you provided. Note that if you are looking for a module python said was missing in an error code, it may not be the same name you need to use when installing! You will need to search online if the install fails because it says there is no module of that name.
Verifying your install
Regardless of your installation method, you can test if a module was successfully installed by doing the following. First, activate Python by entering python in the terminal which is in the environment you installed the module into. You know you are in python if you see the welcome message and your input changes into >>>.
Once you have activated python, you can test if your module was installed by trying to import it. To import a module enter import <module name> in python.
If you did not get an error, your module is successfully installed! You can also give modules an alias when importing for easier use.
Installation troubleshooting
If you are not able to successfully install OpenRefine on your own, please attend DataLab’s Virtual Office Hours. Click here for more information and to receive a Zoom link.
R/RStudio
Introduction
“R” is both a free and open source programming language designed for statistical computing and graphics, and the software for interpreting the code written in the R language. RStudio is an integrative development environment (IDE) within which you can write and execute code, and interact with the R software. It’s an interface for working with the R software that allows you to see your code, plots, variables, etc. all on one screen. This functionality can help you work with R, connect it with other tools, and manage your workspace and projects. You cannot run RStudio without having R installed. While RStudio is a commercial product, the free version is sufficient for most researchers.
Why learn R? There are many advantages to working with R.
- Scientific integrity. Working with a scripting language like R facilitates reproducible research. Having the commands for an analysis captured in code promotes transparency and reproducibility. Someone using your code and data should be able to exactly reproduce your analyses. An increasing number of research journals not only encourage, but are beginning to require, submission of code along with a manuscript.
- Many data types and sizes. R was designed for statistical computing and thus incorporates many data structures and types to facilitate analyses. It can also connect to local and cloud databases.
- Graphics. R has built-in plotting functionalities that allow you to adjust any aspect of your graph to effectively tell the story of your data.
- Open and cross-platform. Because R is free, open-source software that works across many different operating systems, anyone can inspect the source code, and report and fix bugs. It is supported by a large community of users and developers.
- Interdisciplinary and extensible. Because anyone can write and share R packages, it provides a framework for integrating approaches across domains, encouraging innovation.
R/RStudio on Windows
Follow these step-by-step instructions to install R and RStudio on a Windows machine:
First open your internet browser of choice, and navigate to https://www.r-project.org/. Click on download R.
On the following page, select the link under whatever location is closest to you for the best download speed (though any will work).
Next, click the Download R for Windows link.
Click on the link base to go to the download page.
Finally, click Download R X.X.X for Windows to download the installer.
When the download is complete, run the R installer. This will look slightly different depending on your browser.
Select your language and then accept the license agreement by hitting Next >.
Leave the default install location and select Next >.
If you know what kind of machine you are on, you can specify if you want the 32 or 64 bit version of R. If you do not know, it is safe to install both.
Keep the default startup options and hit Next >.
You most likely will not want an R shortcut on your desktop, as you will almost certainly use RStudio as an interface. You can still have one if you would like. Otherwise, accept the defaults and hit Next >.
Wait for the instillation to complete.
Once it is done, hit Finish. You’ve now installed R! However, we still need to install RStudio separately.
Navigate to the RStudio homepage at https://rstudio.com/ and click the download button.
Scroll down and select the free version to download. If you are using RStudio for commercial purposes you will need to look into RStudio’s licensing terms to see if you need to pay for the pro version.
Download the RStudio installer for your machine.
Run the installer just as you did for the R download.
R/RStudio on Mac
If you are installing R/RStudio on a Mac, there is no extra configuration. Simply go the download pages for R and RStudio and choose the latest version for mac. Run the installer package when it is finished downloading. If you receive an error regarding the app being from an unidentified developer, please follow the instructions here.
Verifying your install
Once you have installed both R and RStudio, you should be able to run RStudio on your machine. You can verify your install is working by opening RStudio and typing paste("Hello World!") into the console as shown below. If the code runs you should see a response that says [1] Hello World!. If that works you are all set!
Installation troubleshooting
If you are not able to successfully install R/RStudio on your own, please attend DataLab’s Virtual Office Hours. Click here for more information and to receive a Zoom link.
R Packages
Introduction
R packages are community made functions that automate or expand the things you can do in the R language. The process for installing them is largely the same for both Windows and Mac. There are three main methods for installing R packages. You can install them from the Comprehensive R Archive Network (CRAN) from within R, from another online source like Github, or from files on your local machine. We will cover each of these methods here.
Packages from CRAN
The vast majority to packages in R can be installed from CRAN. You will need the name of the package you want to install. Once you have the name (case sensitive!), you can download it from within R using the install.packages() function in the R console.
For example, if you wanted to install the skimr package, you would enter install.packages("skimr"). Note that you do need to put the package name in quotes, and that it is case sensitive. You can install multiple packages at a time by passing a vector of package names to install.packages(), for example: install.packages(c("skimr", "corrplot")).
You can test if the packages installed correctly by calling it in R using the library() function.
Pacakges from Github
Not all packages are available on CRAN, especially very new or very old packages. For these you will most likely need to install them from an online repository, the most common of which is Github. There are some R packages that make this process easy. We will use the remotes package.
First install the remotes package by running install.packages("remotes") in the R console. If it installed correctly, you will see package ‘remotes’ successfully unpacked and MD5 sums checked.
Once remotes is installed, you can install packages directly from an online repository. We will use remotes’s install_github() function to demonstrate. Let’s try installing the handy wordcountaddin for RStudio. First navigate to the package page on Github: https://github.com/benmarwick/wordcountaddin.
Most packages that need to be installed from an online source will give you the code to do so in the readme of their repository. Another common package for installing from online repositories is called devtools, but this package contains many other functions we don’t need for this task. You can replace devtools with remotes almost all of the time.
For our case, type remotes::install_github("benmarwick/wordcountaddin", type = "source", dependencies = TRUE) into the console and hit enter. You may be asked if you would like to update other packages. Enter 1 to try and update all. If this doesn’t work, you can enter the command again and try 3 to ignore these updates.
Read the resulting install output. As long as it does not indicate anything failed to install, you’re done!
You can now use the helpful wordcountaddin! You can find it in the Addins dropdown above the editor pane.
Packages from Local Source
On occasion it will be necessary to install an R package from a local source .tar.gz file. You would need to download this file yourself, then install it. For example, let’s download then install the mgcv package from https://cran.r-project.org/web/packages/mgcv/index.html. First navigate to that page, and click on the download link next to package source.
Once the .tar.gz file is downloaded, leave it as it is, do not decompress it. Instead, find the file on your system and take note of it’s location. On Windows you can shift-right click on it, then select Copy as path.
In your R console, type in install.packages("path/to/your/source_file.tar.gz", repos = NULL, type = "source"). R will start installing the package from the local copy.
Two things to note for Windows users:
You will need to have installed R tools on your machine. There is a guide for that lower in this document.
If you are on windows, you will need to make sure your file path uses forward slashes (
/) rather than back slashes (\).
R Tools
Introduction
R Tools is a bundle of programs on Windows that allows R to build packages from local source files, rather than installing through CRAN. The vast majority of the time this is unnecessary, but some circumstances require it. Mac users do not need to install R Tools. ## RTools on Windows
First, navigate to the R tools website at https://cran.r-project.org/bin/windows/Rtools/.
Scroll down to the download links and select the rtoolsXX-x86_64.exe link.
Once the download has finished, run the R tools installer.
On the first page of the installer, select a custom location if you would like, but otherwise press Next >.
Leave the defaults for the additional tasks and hit Next >.
On the next page, hit the install button and wait for it to complete.
One the install is done, press Finish. You’re not done yet though!
In order for R to make use of R tools, you need to add it to the PATH that R looks for tools on. To do this, open R or R Studio and type writeLines('PATH="${RTOOLS40_HOME}\\usr\\bin;${PATH}"', con = "~/.Renviron") in the console and press enter.
Verifying your install
To verify R tools was installed successfully, first restart R to assure you are in a clean environment. You can do this by closing and re-opening your R or RStudio window. Afterwards, type Sys.which("make") (case sensitive!) into the console. If you see "C:\\rtools40\\usr\\bin\\make.exe" as a result you are all good!
Installation troubleshooting
If you are not able to successfully install R Tools on your own, please attend DataLab’s Virtual Office Hours. Click here for more information and to receive a Zoom link.
Contributions
This research toolkit is maintained by the UC Davis DataLab, and is open for contribution. See how you can contribute on the Github repo.
This toolkit has been made possible thanks to contributions by:
- Carl Stahmer
- Jared Joseph